On the Role of Decision Theory 
in Uncertainty Analysis 

Merlin Keller 1 ' 2 , Eric Parent 2 , Alberto Pasanisi 1 , 

1 EDF R&D, Chatou, France 
2 AgroParisTech, Paris, France 



Abstract. Maximum likelihood estimation (MLE) and heuristic 
predictive estimation (HPE) are two widely used approaches in in- 
C^h dustrial uncertainty analysis. We review them from the point of view 

^ of decision theory, using Bayesian inference as a gold standard for 

comparison. The main drawback of MLE is that it may fail to prop- 
el erly account for the uncertainty on the physical process generating 

the data, especially when only a small amount of data are available. 
HPE offers an improvement in that it takes this uncertainty into ac- 
P | count. However, we show that this approach is actually equivalent 

to Baycs estimation for a particular cost function that is not explic- 
! t itly chosen by the decision maker. This may produce results that 

C$ are suboptimal from a decisional perspective. These results plead 

for a systematic use of Bayes estimators based on carefully defined 
cost functions. 

' ^ Key words and phrases: Uncertainty analysis; decision theory; epis- 

temic uncertainty; Bayes estimation; maximum likelihood estima- 
tion; heuristic predictive estimation. 
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1 Introduction 



More and more applied industrial studies involve nowadays explicit uncertainty 
assessment [20, 31, 12]. Indeed, in industrial practice, engineers coping with 
quantitative predictions using computer models must face numerous and quite 
different sources of uncertainty, which affect the results of their studies. In most 
practical problems, the objective is to study the probability distribution of the 
output of a deterministic model, the inputs of which are random variables. 

Uncertainty on the model, due for instance to the prohibitive computational 
cost its implementation may require, or to its discrepancy with the physical 
system it represents, can be treated in this framework, which is sometimes 
called Uncertainty Analysis [24]. However, for simplicity we do not tackle such 
sources of uncertainties in what follows. Furthermore, the model is assumed to 
provide deterministic outputs for each given set of inputs, that is, to each input 
value corresponds a unique output value. 

Within this paper, we will voluntarily stay within this framework, which is 
rather common in the industrial practice [12]. The ingredients of the problem 
are then: 

- a preexisting physical or industrial system, represented by a deterministic 
model G(-); 

- inputs of the physical model, affected by various sources of uncertainty, 
and modeled jointly as a random vector, noted X in the following. Hence 
we adopt probability theory as a means of quantifying uncertainties, and 
do not consider here the use of other tools, such as fuzzy logic [2, 41], 
Dempster-Shafer Theory [35, 40, 7], etc. 

- outputs of the physical model, i.e., deterministic functions of the inputs, 
which we note in the following: Y = G(X). In the following, we will limit 
ourselves for simplicity to scalar outputs, though our arguments can be 
extended to multivariate outputs; 

- an amount of data and expertise available to assess input uncertainty, 

- industrial stakes that motivate the uncertainty assessment more or less ex- 
plicitly (safety and security, environmental control, resources optimization 
etc.). 

The industrial stakes guide the choice of one or more particular quantities 
of interest summarizing the main results of the uncertainty analysis, e.g. the 
mean and/or a given quantile of the model output. 

Note that this framework also includes the case where the deterministic 
model G(-) is the identity function, i.e. the classical problem of predicting par- 
ticular features of the probability distribution of an observed random variable. 

Whatever the complexity of the pre-existing model (the identity function 
or a time-consuming computer model), apart from computational issues, the 
problem is, at least conceptually, the same: given some information about an 
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observable variable, predict some quantities of interest either of its probability 
distribution or of a deterministic function of it. The problem is then statistical 
in essence and involves the two classical steps of statistical analysis: inference 
(induction) and prediction (deduction) [18]. 

In the uncertainty analysis framework, the engineer takes (or not) explicitly 
into account the uncertainty affecting the estimation of the parameters of the 
distribution function of model inputs, by adding (or not) an additional uncer- 
tainty layer. In some cases, he/she often ignores this uncertainty and simply 
considers that the chosen inference method, such as maximum likelihood estima- 
tion (MLE), provides the exact value of the parameters. Although other estima- 
tion techniques such as the moments or weighted moments, pseudo-likelihood, 
etc. can be considered, this paper emphasizes MLE as it is by far the most 
popular. The propagation of the probability distribution of the inputs through 
the deterministic model leads thus to a distribution of outcomes, as if the true 
value of the parameter were the MLE. 

In some other cases, the uncertainty affecting the value of the parameters is 
modeled by using a proper probability distribution function. If the statistical 
analysis has been done in the Bayesian framework, this distribution function is 
the parameter's posterior distribution; otherwise the distribution of a particular 
estimator (most of the time the normal asymptotic approximation of the MLE) 
can be used [12]. 

Once this additional uncertainty level has been added, the resulting mixture 
probability distribution function (pdf) for X is propagated through the deter- 
ministic model, for instance using the double monte-Carlo approach. This yields 
a distribution of the model output, which is known as the posterior predictive 
distribution if the chosen setting is Bayesian. It is tempting to conclude [1] that 
this predictive distribution summarizes all the sources of uncertainty affecting 
this output, and to evaluate the quantity of interest directly using this predic- 
tive distribution. For instance, if one is interested in a given quantile of the 
output, then, according to a common practice, one could often evaluate it as 
the corresponding quantile of the predictive distribution. 

This heuristic approach, in which essentially the inference problem is sepa- 
rated from the prediction phase of the statistical analysis, can lead to erroneous 
results as it is not coherent with statistical decision theory, as we will show in 
the following. The main purpose of this paper is to put standard engineering 
practices into a more formal framework so as to point out the (more or less) 
hidden hypotheses underlying the heuristic approach. Avoiding any dogmatic 
temptation, this paper aims to let the uncertainty practitioner be aware of these 
underlying hypotheses, to let he/she decide if they could be accepted or not, 
given the particular context of the industrial study. 

The rest of this paper is organized as follows. We start in Section 2, by 
recalling the basic principles of decision theory and Bayes estimation. This ap- 
proach, known to be optimal from a decisional viewpoint, constitutes a gold 
standard to which other approaches may be compared. In Section 3, we discuss 
the relevance of MLE, whose justification relies on large sample properties, in 
industrial risk problems where the data are often scarce. The heuristic use of the 
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predictive distribution briefly sketched above is reviewed in Section 4. We show 
that the use of this seemingly generic heuristic is equivalent to Bayes estimation, 
and induces implicitly the choice of a particular cost function that depends on 
the mathematical expression of the chosen quantity. Section 5 compares all re- 
viewed approaches on a dyke reliability assessment problem toy example. These 
results and their implications for industrial uncertainty analysis are discussed 
in Section 6. 

2 Decision theory and Bayes estimators 
2.1 General problem formulation 

In the following, the model inputs will be modeled jointly as a random vec- 
tor X, distributed according to an unknown probability distribution function 
F. However, instead of dealing with such a functional uncertainty, we restrict 
ourselves to look for a distribution belonging to a parametric family, admitting 
a pdf f{-\9), specified by an unknown parameter vector 9. 

The model output is noted Y = G(X), and its pdf is noted p(-\9) in the 
following. We additionally assume that the physical process represented by G 
is not random. When G is invertible, the output's pdf is thus given by the 
formula: p(y\0) = /(C?- 1 (y) |^)/|G" (G" 1 (y)) | . 

We also assume, as is in general the case, that the data are a sample D = 
(x\, . . . , x n ) of input values, that is, independent realizations of the variable X. 
We note £(D\6) = Uf =1 f(xi\9) the likelihood of the data vector. Based on this 
model and the data D, our goal is twofold: 

i. Infer the law of X, i.e. in our case estimate 9; 

ii. Deduce an estimate of the pdf of Y, and of a certain quantity of interest 
(mean, quantile, etc.) of this law, that is, a certain function <fi = 4>(F) of 
the inputs' distribution. In the parametric framework we have adopted, 
this reduces to a function <fr(9) of the model parameters. 

Illustrative example Imagine we wish to assess the lifetime X of an in- 
dustrial component, based on the previously recorded lifetimes x\.....x n of 
identical components, generated by the same production chain. As is a classical 
approach in survival analysis, we postulate that D — {xi, . . . ,x n ) is a sample 
from the exponential distribution £(6), with pdf f(x\9) = (l/9)e~ x ^ e . Here the 
parameter 9 is seen to be the average lifetime, that is, E[X|#] = 9. Note that we 
have no physical model here, so that G(-) reduces to identity, and Y = X. 

Once this model has been established, we can compute quantities of interest, 
for instance the reliability of the component, that is, the probability that it lasts 
more than t units of time, which we can write as: </> = P[X > t\9] — e~ l l e . If 
the component under study is a light bulb for instance, t = 2 years corresponds 
to standard regulations in order to ensure some acceptable working life period. 
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In practice, (ft, like 9, is unknown but we can use the data D to estimate 
this quantity of interest by (ft — <p(xi, . . . , x n ). If this estimated reliability is 
judged too small, then the production chain might be stopped, or certain of the 
machines involved in the process may have to be fixed or changed, etc. 

For example, if (ft < 1% then the production will be discarded, if 1% < (ft < 
15% then the production will be sold at a lower price with reduced guarantee; 
if (ft > 15% then the production will be released at the ordinary price with the 
usual guarantee. 

In other words, the mapping: 

(xi,...,x n ) i ^ ^ = $(xt,...,x n ) 

allows for a decision rule between what is known from the piece of information 
carried by the sample D and the possible final decisions that the industry man- 
ager can make, such as: {discard, sell at a lower price, release} in the previous 
example. By extension, we will call a possible value d for (ft a 'decision', even 
though the line of reasoning should go one step further to reach the operational 
decision. 

2.2 Cost function 

As illustrated above, in most industrial studies, the estimated value (ft of the 
quantity of interest is used to take a certain decision. Because each decision has 
a context-dependent cost, intuitively the aim of the predictioner is to choose 
the value of (ft which minimizes the anticipated value of the cost. 

This notion of prediction is formalized in the context of decision theory by 
the concept of loss or cost function, that is, a certain function C((ft,d) that 
measures the cost resulting from decision d when the quantity of interest is (ft. 
Here the 'decision' d is simply the chosen value for the unknown quantity (ft. 

To continue with the previous example, imagine that the true (ft is 2.1% 
and, due to the imperfect partial information we choose d = 75%. In other 
words we did not see that the situation was out of control ((ft = 2.1%) and 
keep business as usual (d — 75% therefore release the production in ordinary 
conditions). This error may damage the company's reputation and lead to the 
loss of market shares, quantified by the cost C((f> — 2.1%, d — 75%). This cost 
may be substantially higher than the cost C((ft = 60%, d — 10%), associated to 
a false alarm situation in the light bulb industry. 

However, the precise expression of the cost function is not always available to 
the engineer in charge of estimating (ft, either because it has not been provided 
explicitly by the decision maker, or because the exact costs are hard to evalu- 
ate. In such reasonable expression for C((ft,d) must be chosen, which 
shares the main features of the real cost function, according to the available 
information. 

It is usually required that this 'surrogate' cost function respects some rea- 
sonable properties. For instance, the cost is usually chosen to be convex and 
minimal when (ft is matched by the decision d : min^ C((ft, d) = C((ft, (ft) [37]. 
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Figure 1: Cost functions for the estimation of a tail probability. Continuous line: 
quadratic loss, dashed line: log-quadratic loss, dash-dotted line: weighted absolute 
loss, with C2 = 10 x CI. The vertical line corresponds to the true value <j>. Cost 
functions are normalized for viewing convenience. 

A useful concept, closely related to that of cost function, is the concept of 
regret function, also termed opportunity loss [33] . It is defined as the difference 
between the cost of decision d and the cost of the optimal decision 4> ■ 

R e (^d) = c(«M) -<?(<M)- 

In other terms, R e (cf>,d) is the additional cost one has to pay when taking a 
decision d different from the optimum. It can also be interpreted as a measure of 
information value, since it is the maximum price the decision maker is willing to 
pay to learn the optimal decision <j> [23]. Arguably, losses more naturally describe 
real-life problems than cost functions, and are therefore easier to define, when 
the true costs are not known precisely, as illustrated in the following examples. 

Usual loss functions A popular choice of cost function is the quadratic loss, 
defined by 

C(<j>,d) = C x(d>-d) 2 . (1) 

Its use implies that the cost is quadratic in the error on <f>, and does not depend 
on its sign. It can be derived as a second-order approximation of the 'true' cost 
function, Co representing a constant marginal cost. Note that according to this 
justification, one should also include the minimal cost term C TO , corresponding 
to the optimal decision <f>. Hence (1) is in fact more readily interpretable as a 
regret than truly a cost function. 
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Often used as a default choice, the quadratic loss is however inappropriate 
if the cost depends on whether </> is over- or under-estimated. For instance, as 
illustrated above, the failure of an industrial component may have much more 
costly consequences than those resulting from stopping or fixing the production 
chain. In this case an asymmetric cost function might be preferable, such as the 
weighted absolute loss: 

C(<M) = |^-d|x (Cil {d<0} + C 2 l {d><A} ). (2) 

This loss signifies that the additional cost of decision d is proportional to its 
absolute deviation from </>, multiplied by a different factor depending on wether d 
over- or under-estimates 4>. This loss function is interesting from an operational 
point of view, in that it allows the user to specify precisely what costs are 
involved for each type of error. 

Though the above cost functions are the most common, many others have 
been used in decision-oriented estimation problems, such as the a— absolute loss 
[10], the LINEX [39] or the entropy loss [32]. Going back to the problem of tail 
probability estimation, another criticism that can be done about the use of the 
quadratic loss is that in general we are less interested in the probability itself 
than by its order of magnitude, that is, its logarithm. Thus, we might consider 
using the log-quadratic loss: 

C(<M) = (lo g 0-logd) 2 . (3) 

All three losses are illustrated in Figure 1, in the case where <fi is a tail probability. 

2.3 Bayesian approach 

Minimizing the loss C(4>, d), or equivalently the regret R e (<fi, d), is impossible in 
practice because <fi, like 9 from which it is derived, is unknown (otherwise, in a 
perfect information situation, we would have the trivial solution d = cf>). Instead, 
the Bayesian approach consists in describing the uncertainty on the unknown 
parameter 9 by a prior probability density it (9), specifying the distribution of 
all possible values of 9. This prior information is updated using the data to yield 
the posterior distribution ir(9\D), according to Bayes' rule: 

ir(9\D) cx £(D\9)ir(9). (4) 

The optimal decision in this setting is obtained by minimizing the expected 
posterior opportunity loss, E[R e ((f>, d)\D], or equivalently, the expected posterior 
loss: 

E[C(<f>,d)\D] = / C(<t>(0),d)Tr(0\D)dO. (5) 
Je 

As stressed in [13], this procedure can be interpreted in terms of an inte- 
grated sensitivity analysis, where each possible cost resulting from decision d 
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Figure 2: Bayes estimation as an integrated sensitivity analysis. The Bayes estimate 
is the decision d (dashed line) that minimizes the cost associated with the difference 
(shaded area) between the quantity of interest (j> (thick line) and the decision d, pref- 
erentially in regions of high posterior density n(9\D) (thin line). In this particular 
example, is a tail probability in an exponential model, as explained in Section 2.1. 



is weighted according to the posterior probability of such a cost, evaluated con- 
ditionally on the available data. This is illustrated in Figure 2. Considered as a 
function of the sample D, the decision that minimizes this quantity is called the 
Bayes estimator, relative to the prior ir(9) and the loss C(-, •). In the following, 
we will note this statistic: 4> BAY = 4> BAY (D). 

Besides being intuitive, Bayes estimators are also known to have remarkable 
properties. Most importantly, they are systematically admissible, meaning that 
it is impossible to find an estimator with a uniformly lower risk [17], the risk 
of a decision rule 5 (with d = S(D)) being defined as the average loss over all 
possible datasets, conditional on the true parameter value 9 : 

TZ(m,S) := I C(m,5(D))£(D\6)dD. (6) 
Jd 

Thus, for a given estimation problem, once a cost function has been defined, it is 
in principle possible to consider only Bayes estimators as reasonable candidates, 
since all other rules are potentially inadmissible, under regularity conditions on 
the likelihood and the cost function [3]. Additionally, once a prior distribution 
on 8 has been specified, a unique estimator is constructed pointwise (i.e. for each 
sample of information), by minimizing (5), which is equivalent to minimizing 
the Bayes risk, 
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R n (S) := [ K(<l>(O),6)n{9)d0, 
Je 



(7) 



over all the functional space of the decision rules S [37]. This means that the 
choice of an optimal estimator, in the framework of decision theory, is uniquely 
determined by the problem specification (the cost function), and the amount of 
prior information available on the uncertain parameter. 



2.4 Bayes estimates for simple cost functions 

In general, minimizing the posterior expected loss (5) may be difficult, depend- 
ing on the choice of a loss function and prior distribution. However, the ele- 
mentary cost functions mentioned earlier yield simple expressions for the corre- 
sponding Bayes estimate, which may explain partly their popularity. 

Quadratic loss It is straightforward to see that the Bayes estimate for the 
quadratic loss (1) is the posterior mean: 

BAY = E[<P\D] 

4>{6)-K{d\D)d6. 



Weighted absolute loss Under the weighted absolute loss (2), calculations 
are more involved. In [13], it is shown that the resulting Bayes estimate B ay is 
such that: 

n<t><j BAY \D] = j^r- 

In other terms, if C ^ C2 — 95% for instance, then 4> BAY is the 95-th percentile 
of the posterior density of the quantity of interest <f>. In the special case where 
C\ = C2, the optimal decision is seen to be the posterior median. 



3 MLE approach 

We now discuss the properties of the maximum likelihood estimation (MLE) 
approach, which is by far the most popular method of estimation currently 
used. The MLE of model parameters 6, noted (9 MLE in the following, is simply 
the value from which the data are the most likely to have arisen: 



MLE := argmax£(£>|6»). 
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This quantity is then substituted to the unknown value of 9 in the definition of 
4>, yielding the so-called 'plug-in' estimate: 

0mle := 0(0mle)- (8) 

A well-known property of the MLE is that it is invariant by reparameter- 
ization, so that if the model parameter 9 is replaced by v = ip{6), where tp 
is one-to-one, then 9 MLE = ip(9 MhE ), and the value of MLE remains unchanged. 
Note that this is not the case in general for Bayes estimates, which depend on 
the chosen parameterization. 



3.1 Asymptotic properties 

The MLE is engagingly simple compared to the Bayes estimators described in 
the previous sections, as it requires no specification of a cost function C or a 
prior distribution n. Furthermore, its use can be justified by its good asymptotic 
behavior (see [38] for instance). First, the MLE is known to be consistent under 
very mild assumptions. This means that, for a sufficiently large number n of 
observations, 9 MLE becomes arbitrarily close to the true value 9. The same result 
thus applies to </> M le, which tends to the true value <p when n goes to infinity. 

Secondly, under regularity conditions of the likelihood function C(D\6), the 
MLE is asymptotically normal, so that for n 'large enough' (in practice, n = 40 is 
considered sufficient in most applications), its law (conditional on the unknown 
value of 9) can be approximated by: 

&$uJ\0) « AA^.ix-^X^w), (9) 

where 1(9) := E (A \og£(D\9)) 2 \9 is the Fisher information matrix, the 
expectation being taken with respect to the data vector D. This last result 
allows to evaluate the standard error of the MLE and tells us that it decreases 
as 1/y/n. 

However, note that the asymptotic variance in (9) is not directly available, 
since it depends on 9, which is unknown. In practice, it must be estimated, 
for instance by its MLE \T-~ X (9 ULE )((j)' (9 MhE )) 2 . Furthermore, these good large 
sample properties do not constitute a sufficient reason to choose MLE over 
Bayes estimates, because the latter also behave satisfyingly in presence of many 
data. Indeed, Doob's theorem ensures that, under mild conditions regarding the 
prior, the posterior distribution concentrates around the true parameter value 
38]. Furthermore, under essentially the same regularity assumptions as for the 
MLE, the posterior distribution of <j> can be well approximated by : 



&{<t>\D) « M \4>uuz, ^2I~ 1 (^mle)(^ / (^mle)) 2 J , 



(10) 



as shown for instance in [.'->]. Despite the striking similarity between (10) and 
(9), it must be stressed that they have in fact fundamentally different meanings. 
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Indeed, (9) describes the variability of (j) MLE as a function of the random sample 
D, conditional on the unknown value of the parameter 9. In contrast, (10) 
refers to the posterior distribution of plausible values for the unknown quantity 
(j), conditional on the observed data D. 

The main point of this comparison is that, when many observations are avail- 
able, MLE and Bayes estimates become indistinguishable, whatever the choice 
of a prior and cost function, as the posterior variance of 4> around </> MLE tends 
to zero. Thus, in this case there is no reason of preferring one approach over 
the other, though one may argue that the MLE has the advantage of simplicity, 
since it does not require specifying a cost function, nor a prior distribution. 

3.2 Small sample comparison 

In industrial studies however, more than often very few observations are at our 
disposal. This is especially the case when predicting the occurrence of some 
extreme event (such as the loss of a plane's steering mechanism), for which by 
definition very few observations have fortunately been made. In such cases the 
MLE looses all theoretical justification, and may perform very badly for a given 
loss function when compared to the Bayes estimate, which remains optimal in 
the sense described in Section 2.3. The MLE may even be inadmissible, as in 
the classical example of estimating normal means in dimension greater than 2 
] , or when estimating the shape parameter of a Weibull distribution [28] . In 
these cases, its use must thus be avoided. 

Intuitively, the main reason for this poor behavior is due to the fact the MLE 
neglects the uncertainty on the parameter 9 by assimilating it to its most likely 
value. However, especially for small values of n 1 ^mle may be unstable with 
respect to variations of the data, and take significantly different values from <fi. 
This is precisely the issue addressed by the Bayesian approach, since instead of 
considering the loss in just one point, it averages it across all 'probable' values 
of 9, given the information provided by both the prior and the data. This yields 
a more robust estimator (in the frequentist sense), the more so at small samples, 
since then the influence of the prior is non-negligible and tends to counterbalance 
sample variability, that is, variations in the observed variables D. 

4 Heuristic predictive approach 
4.1 Principle 

As we have seen in Section 1, a common way to account for the overall un- 
certainty on both observable and non observable quantities is to replace the 
unknown pdf p(y\9) by the posterior predictive density: 



Jo 

that is, to average the pdf over all possible values of 9, weighted by their prob- 
ability given the information provided by the data D. Here the output Y is 




(11) 
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Figure 3: Density of the 7V(10, 1) distribution (continuous line) and corresponding 
posterior predictive distribution (dashed line), estimated from n = 10 observations 
(dots), with a normal-inverse Gamma prior. Vertical segments correspond to the 5-th 
and 95-th percentiles of each density. 

interpreted as a future observation we wish to predict. For instance, in the sur- 
vival analysis example introduced in Section 2, Y could be seen as the lifetime of 
the (n + l)-th industrial component, to be predicted based on the n previously 
observed lifetimes. As illustrated in Figure 3, the predictive density is typically 
more spread out than the true density, because uncertainty on 9 has been added 
to the variability of Y. 

A seemingly natural use of the predictive density is to estimate any char- 
acteristic quantity of the (unknown) density p(y\9), such as expected values, 
tail probabilities, quantiles, etc., by the corresponding characteristic quantity of 
p{y\D) [16, 9]. Indeed, this provides a generic procedure to estimate any quan- 
tity of interest <f>. Furthermore, this procedure has the advantage of accounting 
for the uncertainty on 9, having weighted each of its possible values according to 
their probability given the data. In the following, we will refer to this estimate 
as the heuristic predictive estimator (HPE). 

Variants of the posterior predictive p(y\D) are also commonly used, repre- 
senting the uncertainty on 9 by other densities than its posterior. These include: 

- The prior density tt(9) (see [21, 25] for instance). This yields the prior 
predictive density: 



Je 

which is thus defined independently from the data D, or more precisely 




(12) 
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with D — {}, that is, in absence of data. In this case, inference on the 
output Y is purely based on prior, or expert, knowledge concerning the 
physical process generating the data. This is the case in [21], where the 
future performance of a waste disposal plant is assessed based solely on 
the outputs of computer programs. These programs encode the experts' 
view of the plant's functioning and environment. 

- The normal p{9) = N ( MLE , — X^ 1 (9 MLE ) ] as suggested in [] ]. From 

V n ) 

Section 3.1, this can be seen as the asymptotic approximation (9) of the 
sample distribution of UIM (conditional on 9), estimated by MLE. 

However this interpretation is not coherent with the definition (11) of the 
predictive density, which requires the distribution of possible values of 9 
to be conditional on the data D; a more appropriate interpretation would 
be to consider the above Gaussian as the asymptotic approximation (10) 
to the posterior distribution ir(9\ D). Thus it may be used as a convenient 
substitute when a large number of observations are available. 

But, we have already stressed that large datasets are rather exceptional 
in real-life industrial studies. Hence for small samples, other types of pos- 
terior density approximations are required, such as provided by stochastic 
sampling, e.g. Monte-Carlo Markov Chain (MCMC), techniques. 

A practical implementation of HPE is the double Monte-Carlo algorithm. It 
consists in generating random values 9\ , . . . , 9i from the posterior density n(9\D) 
(or one of the above-mentioned alternatives), than simulating for each i — 
1, . . . , I one output t/j from the conditional density p(yi\0i)- This generates a 
sample j/i , . . . , j/j from the predictive density (11), from which one can eas- 
ily construct Monte-Carlo estimates of any desired quantity. Thus the double 
Monte-Carlo algorithm is essentially a technique to numerically integrate the 
joint posterior density of (0,Y). 

4.2 Interpreting the predictive density 

As described above, the principle of HPE is to replace the sample density p(y\9), 
representing the true physical phenomena under study, and available only in a 
perfect information setting, by the predictive density p(y\D), which accounts 
for the uncertainty on 9 when it is not known precisely. This use of p(y\D) 
as a surrogate for p{y\9) suggests that both densities are, to a certain point, 
exchangeable. The goal of this section is to stress that, though very similar in 
their mathematical expressions, these distributions have in fact very different 
meanings, and should not be mistaken for one another. 
To see this, consider the following toy examples: 

1. Coin tossing. Suppose a well balanced coin is flipped 8 times, and each 
flip is recorded (1 for 'heads', for 'tails'). Any sequence of 8 binary digits 
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is equally probable, so the result of this experience might be: 

1110 1111. (13) 

If asked what are the chances that the next throw will land on 'heads', the 
obvious answer is 1/2, since we now that the coin is well-balanced. Note 
that this future throw is totally independent from the previously recorded 
ones. 

2. Light bulbs, revisited. Now assume light bulbs coming out from a 
production chain are tested, and their status reported (1 for a correctly 
working bulb, for a faulty one). Suppose the the 8 first light bulbs lead 
to the following sequence: 

11101111. 

If asked what are the chances that the 9-th light bulb will work correctly, a 
natural answer would be 7/8, based on the fact that 7 out of the 8 previous 
bulbs where working. Note that here the status of the future light bulb is 
considered dependent on that of the previously observed ones. 

Interpreting the above examples using the notations introduced in Sec- 
tion 2.1, in both cases the data D, equal to (13), consists in n = 8 independent 
realizations of a Bernoulli variable X, with P[X = 1|0] = 9. The quantity of 
interest <p is the probability that a future realization Y of the same law is equal 
to 1, that is, (/>(9) = P[Y = l\6] = 9. 

The only difference between both examples is that in the first case 9 is known. 
Hence, all statements are made conditional on 9, the future coin flip Y is in- 
dependent from the previous ones. This is an example of perfect information, 
because the distribution f(y\9) of the future observation is available; the un- 
certainty it describes is purely aleatory, in that it corresponds to the variability 
associated to a natural phenomenon (a coin toss). 

In contrast, in example 2 the proportion 9 of well functioning light-bulbs 
is not known in advance, so it must be estimated from a sample D of the 
production chain. Hence in that case the status Y of the future light bulb is not 
independent from the previously recorder ones. This is an example of imperfect 
information, the distribution of the future observation f(y\9) being unavailable. 
The predictive distribution f(y\D) can be used instead; in fact, 7/8 is easily 
seen to be the HPE of P[Y = 1\9], relative to the improper beta prior B(0, 0). 

Note that the uncertainty described by f(y\D) mixes the aleatory uncertainty 
associated to Y with the epistemic uncertainty on the parameter value. As 
such, it looses all phenomenologic interpretation. Rather, the predictive density 
represents one's current state of knowledge concerning the possible values of Y. 
After all, as stated in [29], 'Bayesian theory only describes how an individual's 
uncertainties are formed and modified by evidence'. 

In conclusion, in both examples, the same terminology is used to talk about 
'the probability that the future observation Y equals 1'. However, in the first 
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case, this refers to the true probability of a natural phenomenon, while in the 
second the probability is a subjective one, reflecting what the analyst is ready 
to bet on the outcome of an uncertain experience, in the sense of [34, 11]. 

The risk of shortcuts in mixing epistemic and aleatory uncertainties extends 
in fact to all Bayesian analysis [20, 22], though it is most apparent in the pre- 
dictive density. [14] for instance advocates the use of interval analysis as an 
alternative to propagate epistemic uncertainty, combined with a probabilistic 
modeling of intrinsically random variables. Such a hybrid approach has the ad- 
vantage of explicitly separating both sources of indetermination, and also avoids 
the issue of choosing a prior distribution. However it may raise other problems; 
for instance it is not clear wether the optimality properties of Bayesian estimates 
(see Section 2.3) are retained in this setting, or how a combination of probabil- 
ity distribution and intervals can be used in a decision-making process. Hence, 
though we acknowledge the risk of confusion inherent to modeling parametric 
uncertainty by a probability distribution, we argue that it is still perfectly feasi- 
ble to distinguish between both sources of uncertainty (see [21] for an example 
in the context of nuclear waste disposal). 

In the following, we show that an important guideline for avoiding such issues 
is to focus on the decisional framework underlying the uncertainty analysis, 
while avoiding justifications based on heuristic interpretations. 

4.3 Estimation of an expected value (including failure prob- 
ability) 

We now derive the HPE of the expected value of the output Y, that is, the 
integral of Y with respect to its density so that the quantity of interest is: 

4> = E[Y\9] = fyp(y\6)dy. 
Jy 

This is estimated by the predictive mean: 

0hpe = E[Y|£>] = [ y P (y\D)dy. 

However, recalling that p{y\D) is itself defined in (11) as an integral, and ex- 
changing the order of integration,we can re-write the predictive mean as: 

E[Y\D] = J^J yp(y\6)dy^(9\D)de = J<t>(fi)ir(fi\D)M 
= E[0|D]. 

In other terms, in this case <f> is simply estimated by its posterior mean! Fur- 
thermore, as recalled in Section 2.4, the posterior mean is the Bayes estimate 
associated to the quadratic loss function. Thus, applying the HPE approach to 
E[Y|0], we have actually performed a Bayesian estimation, implicitly assuming 
a quadratic loss, which may seem reasonable in this case. 



16 



A practical consequence of this result is that, if E[Y|0] is available in closed- 
form, than its posterior mean can be evaluated by a simple Monte-Carlo al- 
gorithm, requiring only a sample from 9's posterior distribution. Hence, the 
double Monte-Carlo procedure described at the end of the previous section is 
unnecessarily complicated in this case. 

The same reasoning applies in fact to any interest function <f> that can be 
expressed as the expectation of a function of Y. This includes the expectation 
of any power of Y, or the probability that Y exceeds a certain value t, which 
can be written as the expectation of l{y>t}- The problem of estimating such 
a probability, arises in the context of structural reliability where the failure of 
a system occurs when a given state variable is greater (or lower) than a fixed 
'safety limit' (see for instance the lifetime example presented in Section 2.1). 

We may therefore state the following theorem (also found in [9]), whose proof 
is given in Appendix A: 

Theorem 1 Heuristic predictive estimate of an expected value. 

The HPE of any quantity of interest that can be written under the form 
4> = E[h(y)\8] for a certain function h is the posterior mean: 

0HPE = E[<t>\D], 

that is, the Bayes estimate associated to the quadratic cost function. 

This result raises some concerns, since it suggests that the HPE is actually 
a Bayesian estimator in disguise. Moreover, this heuristic implicitly forces the 
choice of a particular cost function, which seems to depend on the expression 
of the quantity of interest, rather than on decisional aspects. For instance, as 
mentioned earlier in Section 2.4, when estimating the probability that an indus- 
trial component's lifetime exceeds a certain length, under and over estimations 
may have very different consequences, so we may want to use a dissymmetric 
cost function in this case rather than the quadratic loss. 



4.4 Estimation of a quantile 

Can the HPE always be interpreted as a Bayes estimator? In the case of quan- 
tiles, there is no simple answer to that question. 

To make this question more precise, consider <J) to be the quantile q a — q a {0) 
of order a of the density of Y, defined by: 

P[Y<g Q |0] = j qC "p(y\9)dy = a. (14) 

J — OO 

Then the HPE of q a is the corresponding predictive quantile $^ PE , such that: 
P[Y < Z FE \D] = [ " " p(y\D)dy = a. 

J — OO 

A natural question that arises at this point is wether g^ PE is a Bayes estimate 
of q a , that is, wether it can be defined as the quantity that minimizes the 
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posterior expected loss E[C(q a ,d)\D] (5), for a certain cost function C(q a ,d) 
(as defined in Section 2.2). Intuitively, this would mean that the HPE of q a , 
which is a characteristic quantity of the predictive density p{y\D) of a future 
observation Y, would also be a certain characteristic quantity of the posterior 
density 7r(<7 Q |_D) of quantile q a , such as its posterior mean, standard deviation, 
etc. Unfortunately, such is not the case, as we prove in Appendix B, meaning 
that q^ E is not a Bayes estimate of q a . 

Then what exactly is $^ PE ? The answer is that it can be interpreted as a 
Bayes estimator, but of another quantity than quantile q a . Indeed, from Sec- 
tion 2.4, 5^ PE is seen to be formally the Bayes estimate of Y relative to the 
weighted absolute loss: 

c a (y,d) = \y- d\ x (al {d<y} + (1 - a)l {d>y} ). (15) 

Note however that in the classical definition of a cost function, as recalled in 
Section 2.2, the interest quantity is a function <j)(0) of the model parameters, 
whereas here it is seen to be the value y taken by the future observation Y. 
So q^ FE is a Bayes estimator in a more general sense than the one specified in 
Section 2.3. 

The above interpretation of q^ PE as a Bayes predictor of the future observa- 
tion Y is reasonable when Y is central to the decision process. Such a situation 
is described in [13], in the context of dam conception. Here Y represents the 
yearly maximal water level of a river, and d the height of a dam to be con- 
structed next to the same river. Using the absolute weighted loss, the optimal 
dam height is seen to be precisely the HPE <?" PE , where a is the relative cost 
resulting from a flood (due to an undersized dam), when compared to the sum 
of the global costs (expected damages + sure investments). 

However, if the decision-maker needs to consider costs that are directly re- 
lated to the actual value of the quantile q a , then the HPE should not be used, 
since it addresses a decisional problem that has nothing to do with quantiles. A 
simple approach in this case would be for instance to adopt the quadratic cost 
function (1), in which case the Bayesian estimate of q a is its posterior mean. 

Alternative choices are proposed in [19], in the context of ecological risk 
assessment, the quantity of interest being the minimal concentration HC P of 
a certain chemical hazardous to p% of the species in a given habitat. This is 
expressed as the p-th percentile of the species sensitivity distribution (SSD), 
representing the distribution of tolerance values to the target chemical for a 
randomly sampled species within the studied habitat. It is argued that the use 
of the weighted absolute loss or the LINEX loss allows to rationally choose an 
estimator that optimizes the costs associated with over- and under-estimations. 
The weighted absolute loss yields a certain quantile of the posterior distribution 
of HC P , whereas the LINEX leads to a more complicated estimator, which has 
no explicit form but can be evaluated numerically. Note that, in this context, 
the HPE would be totally useless to estimate HC P , since it would be unable to 
account for the costs resulting from the different types of estimation errors. 

In conclusion, we summarize two important facts concerning the HPE of a 
quantile in the following theorem: 
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Theorem 2 (Heuristic predictive estimate of a quantile.) 

1. The HPE <t^ pe of a given quantile q a is not a Bayes estimate of q a ; 

2. q^ FE is the Bayes estimate of a future observation Y for the weighted ab- 
solute loss c a (y,d) (15). 

This is a striking illustration of how careful one must be when constructing 
an estimator in specifying both the quantity of interest and an expression, even 
imperfect, of the real costs associated to the decisions guided by the estimate, 
as stressed in [4]. Indeed, neglecting these steps may lead to estimators that are 
totally unrelated to the quantity of interest from a decisional viewpoint, and 
may thus under- or over- estimate resulting costs in an unpredictable way. 

5 Example: Dyke reliability estimation 

We now compare the MLE, HPE and Bayes estimates on a case study, con- 
cerning the safety evaluation of a flood protection dyke. There exists a rich 
literature on this subject, an overview of which can be found in [30]. Following 
the terminology introduced in this work, the following example uses the method 
of yearly maxima, meaning that the flood probability is estimated from a record 
of yearly maximal river discharges. 

The variable of interest Y is here the maximal water level, noted Z c in the 
following, of the river at the location of the dyke. Following [ ], we assume that 
Z c can be computed given a number of input variables, following the analytical 
formula: 

Z c = Z v +A-Q 3 ' 5 , (16) 

where: 

- Q is the yearly maximal water discharge (m 3 /s); 

- Z v is the riverbed level (m asl) at the downstream part of the river section 
under investigation; 

- A is a certain constant, depending on the Strickler friction coefficient, as 
well as on the slope, width and length of the river section. 

Additionally, we note h the height of the protection dyke, and consider the 
problem of estimating the probability of a flood, that is, the probability that 
the maximal water level Z c exceeds h. As in the light-bulb example discussed 
in Section 2.1, we assume that the estimated flood probability p will guide a 
certain decision. For instance, if p < 10 -3 , then the dyke is considered safe and 
no particular measure is taken. If 10 -3 < p < 10~ 2 , then the valley downhill of 
the dyke is declared a flood-risk area, and constructions are forbidden therein. 
If p > 10~ 2 , then enlarging the dyke is considered necessary. 
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Each decision p has a certain cost, depending on the real but unknown true 
flood probability p, and which can be described by a cost function C(p,p), as 
explained in Section 2.2. It is reasonable to assume for instance that the cost 
C(p = 10 _1 ,p = 1CP 4 ) resulting from under-estimating the flood probability, 
is more important than the false alarm costs C(p = 10~ 3 ,p = 5.10 -2 ), corre- 
sponding to the scenario where new constructions are needlessly banned from 
the vicinity of the dyke, or even C(p — 10~ 3 ,j5 = 10 _1 ), relative to the unnec- 
essary efforts to enlarge the dyke. 

The case presented here is of course an over-simplified toy example, meant 
for illustrative purposes only, and is not representative of the models used to 
assess hydrological risks in real-life industrial studies. 

5.1 Observation model and prior 

Further simplifying the case-study in [1], we assume that Z v and A are known, 
fixed quantities. The yearly maximum river flow Q on the other hand is intrin- 
sically random, and we suppose that a sample of n = 30 annual maximal values 
D = (gi, . . . , q n ) is available, and can be used to assess the uncertainty on Q. 
According to [30], n = 30 is considered as the minimal sample size allowing a 
reliable estimate of the flood probability. For the exemplary purposes of this 
paper, we chose to model the annual maxima of the river discharge according 
to the Weibull distribution W(r),(3), with scale parameter ij > and shape pa- 
rameter (3 > 0. This distribution has been proposed in the same context for 
instance in [ ], with an additional parameter Qq representing a minimal dis- 
charge. According to this model, the probability that Q exceeds a certain level 
t is given by: 



Given the simplified hydrological model (16) considered here, this means that 
the flood probability p we are interested in is related simply to the parameters 
(r/, /3), following: 



Following [6], we adopt a hierarchical form for the prior density, by defining 
n(rj\(3) as a generalized inverse Gamma (GIG) distribution, meaning that fj, = 
jj -1 /^ follows a Gamma distribution conditional on (3. We also choose a Gamma 
prior for /3, adding a lower bound to ensure existence of the posterior moments 
of r\ [36], so that: 



exp {-(«/„)"}. 



p = ¥[Z c >h\ V ,P] 

= ¥[Z V +A-Q 3 ^>h\ v ,f3} 




(17) 
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Here b{ 

m i P) — ni/m, 1 1 is a prior guess on the median annual flood, /3q is a 
prior guess on /3, and m is a virtual data size which measures the confidence in 
the prior information. 

5.2 Methods compared. 

We compared several estimates for the tail probability p. First, we considered 
the commonly used MLE p MLE , obtained by plugging the most likely parameter 
values (t?mle, Aile) into their unknown value in (17), together with the HPE 
p HPE , which, by a direct application of Theorem 1, is simply the posterior mean 
E\p\D] . 

But, as discussed in Section 2.2, we are less interested in the precise value 
of p than by its order of magnitude, that is, — log 10 p. Thus, we computed 
the Bayes estimate Pbay.i relative to the log-quadratic loss (3) (equivalently, 
-lo gioPBAY.i is the Bayes estimate of — log 10 p using the quadratic loss (1), 
that is, the posterior mean E[— \og 10 p\D]). 

Finally, the above Bayes estimate equally penalizes over- and under-estimation 
of the log-tail probability, even though the risks associated to each type of error 
are very different. Indeed, under-estimating the probability of a flood can result 
in huge environmental, economic and human cost, compared to the safety mea- 
sures discussed in Section 5. Thus we also considered the Bayes estimate p B ay,2 
relative the log-weighted absolute loss, meaning that — log 10 j5 BA y.2 is the Bayes 
estimate relative to the weighted absolute loss (2). As recalled in Section 2.4, 
this is simply the a-th quantile of the posterior distribution of — log 10 p, where 
a = C\/{Cx+Cz) measures the relative cost of under-estimating the flood prob- 
ability order of magnitude. We chose a = 0.1, meaning that we considered that 
under-estimating the flood probability was 9 times as costly as over-estimating 
it. 

5.3 Results. 

Figure 4 shows the values of the above estimators, when applied to a simulated 
dataset. The MLE was evaluated numerically using a standard Newton-Raphson 
algorithm to maximize the likelihood function, while a sample of 10 6 draws 
from the parameters' joint posterior distribution was obtained using importance 
sampling (IS), and used to compute Monte-Carlo approximations of all Bayesian 
estimates. 

In this particular example, the MLE (p ULE = .004), the HPE (p B ay,2 = -009) 
and the Bayes estimate relative to the log-quadratic loss (p B ay,i = -006) all esti- 
mated the flood risk to be smaller than 10 -2 , even though the true probability 
was: p — .013. According to the decisional setting described previously, based 
on these estimates the valley downhill from the dyke would be declared uncon- 
structible. However, the danger of a flood occurring in less than 100 years would 
go unforeseen. 

In contrast, only the Bayes estimate relative to the log- weighted absolute 
loss (pbay,2 = -017) correctly evaluated the flood probability as exceeding 10 -2 , 
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Figure 4: Top: n = 30 maximal yearly water levels, computed using the hydrological 
model (16), from discharges simulated using the W(r] — 1000,/3 = 2) distribution. 
The horizontal line corresponds to the dyke height, d = 53.1m. Bottom: Different 
estimates of a flood probability, on a logarithmic scale. The dyke height d = 53.1m 
corresponds to a true flood probability of p = .013. Prior bet on /3 was taken equal 
to Po = 1.5, and t e defined as the median of the Weibull distribution W(/3o, Vo), with 
prior scale rjo = 800. Limited weight was given to this prior information, by choosing 
m — 1, explaining why the prior 7r(— log 10 p) appears flat. 
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suggesting the necessity of re-sizing the dyke. 

Hence the safest estimate (from a decisional point of view) in this case was 
Pbay,2- This is hardly surprising, since it was the only one based on a loss function 
that explicitly favored overestimation of the quantity of interest, through precise 
quantification of the relative cost resulting from the dyke being flooded, when 
compared to the establishment of safety measures, i.e., enlargement of the dyke, 
or evacuation of the houses in the downhill valley. Because these costs are very 
different, it is important to consider a method of estimation which takes them 
into account. 

6 Discussion 

We have compared the main paradigms used to perform statistical inference in 
industrial studies, namely Bayesian inference, maximum likelihood estimation 
(MLE) and heuristic predictive estimation (HPE). This comparison was con- 
ducted with reference to decision theory, since industrial studies usually guide 
a certain decision. The Bayes paradigm is known to be optimal from this per- 
spective, hence we used it as a gold standard to benchmark both MLE and 
HPE. 

MLE is by far the most popular approach due to its simplicity, however it 
is justified only when a large number of observations are available. Its main 
weakness is that it ignores the uncertainty on the unknown parameter, making 
it unreliable when only a few observations are available. 

HPE represents a definite improvement over the MLE in that it explicitly 
accounts for parameter uncertainty by averaging the output's density over all 
possible values of the parameter, weighted by their posterior probabilities. How- 
ever, we have shown that, for a certain number of interest quantities, HPE is 
in fact equivalent to Bayes estimation, and corresponds implicitly to a certain 
choice of cost function, depending entirely on the mathematical expression of 
the chosen quantity of interest. 

This raises several concerns. First, why use the term 'predictive estimation' 
when actually Bayes estimation is performed? Second, cost functions should be 
chosen based on an evaluation of the costs associated with potential estimation 
errors, rather than the particular expression of the quantity under study. 

In other terms, the major drawback of HPE is that it robs the user of the 
freedom of specifying exactly what estimation problem he wishes to solve, by 
implicitly choosing for him the cost function to be minimized. Furthermore, 
we have seen that interpreting the results of HPE is delicate when one steps 
outside the realm of Bayesian rationale, since the predictive density has no 
phenomenological interpretation, and should only be interpreted in terms of 
probabilistic bets. 

In conclusion, in view of the increasing need to assess uncertainties in in- 
dustrial studies, and the ever more pressing requirement of accurate cost as- 
sessment, the Bayesian framework seems to provide the most useful unified 
framework. Indeed, it relieves the engineer from the difficult choice of an appro- 
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priate methodology, allowing him/her to concentrate on the really important 
questions underlying the study, which are the quantification of the sources of 
uncertainties, and that of the potential costs to be controlled in the decision- 
making process. Furthermore, adopting such a unified framework would make 
it much simpler to compare results from different studies, an important issue 
stressed in [31]. 

It does imply some conceptual and technical difficulties, such as choosing a 
prior distribution, or computing the joint posterior distribution of all unknown 
variables, which may explain why the Bayesian paradigm has not yet been 
universally adopted. However, many solutions exist nowadays to tackle these 
issues. See [27] for instance for an excellent overview of the current state of the 
art in Bayesian estimation techniques, and [5] for their application to engineering 
case studies. When comprehensive Bayesian approaches are unpractical, it is 
also possible to adopt simplifying strategies, such as in [25, 26]. These consist in 
focusing the Bayesian treatment on the most important sources of uncertainty 
with respect to the decisional problem considered, treating other sources by more 
conventional methods, such as MLE. Hence we have every reason to believe that 
more and more industrial studies will make use of the powerful tools of decision 
theory and Bayesian analysis [3] to conduct as simply and efficiently as possible 
the increasingly complex studies required by an increasingly complex world. 
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A Proof of Theorem 1 

By assumption, there exists a function h, such that the quantity of interest <f> 
can be written as: 

= E[fc(Y)|0] = [ h(y)p(y\9)dy. 
Jy 
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This is estimated using the predictive heuristic by substituting p(y\D) for the 
unknown density p{y\9) : 



f h(y)p(y\D)dy. (18) 
Jy 



Expanding the expression of p(y\D), as defined in (11), and exchanging the 
order of integration then yields: 



%) J p(y\9)dyjir(6\D)de = j <j>{0)*(0\D)M 
E[<t>\D] 

a 



B Proof of Theorem 2 

Reasoning by reductio ad absurdum, suppose that q^f E is the Bayes estimate 
of q a , relative to a certain cost function C(q a ,d), that is, the quantity that 
minimizes E[C(q a , d)\D], which we can write formally as: 

^HPE 



q^ PE = argmin / C{z,d)-K qa \ D (z)dz, 

d J z 

where we note iTq a \D{z) the posterior density of quantile q a , given data D and 
prior 7r. The right member of the above display defines a functional H whose 
domain is the space of all possible posterior densities of q a , that is, the space of 
all probability densities h(-) on M. When h coincides with the posterior density 
KqalD of a certain quantile q a , then we have: S(7T 9ci |d) = $^ PE . 

We now show that the existence of such a functional leads to a contradiction 
in the exponential model. Using the notations introduced in the example in 
Section 2.1, the quantile q a is given by: 

q a = eiog-*—. (19) 
1 — a 

Defining the usual conjugate inverse-Gamma prior on 9 : 

tt(0) = 2G(9-n ,S ) (20) 

_ °0 a-nO-1 -So/8 

r(no) 

it is well known that the posterior distribution of 9 is the inverse-Gamma dis- 
tribution IQ{riQ + n, So + S n ), where n is the data size and S n the sum of all 
observations: S n = X\ + . . . + x n . Hence, the posterior distribution of q a is also 
an inverse-Gamma: 

q a \D ~ ig(n + n,\og(-^—)x(S + S n )], (21) 
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and elementary calculations show that the HPE is equal to: 

Z PE = (e log( T^ )/( " 0+n) - l) x {S + S n ). (22) 

Note that, by definition of the functional H, the right member of (22) is the 
image by H of the posterior density defined by (21). The proof proceeds from 
here by showing that log( yz^) itself is the image by a certain functional of the 
posterior density (21), i.e., that one can recover the order a of a quantile from 
the quantile's posterior distribution only! This is of course absurd since we can 
see from (21) that the same posterior scale parameter value log (rz^j x (So+S n ) 

can derive from infinitely many possible couples of values for a and (So + S n ), 
meaning that two quantiles of different orders may have the exact same posterior 
distribution if computed on two different datasets, or for two different values of 
the prior scale So- 

But it is easy to show that ^og(j^) can be deduced from <^, PE , (no + n) 

and log (jz^j x (So + S„). Since the shape and scale parameters of an inverse- 
Gamma distribution can be expressed as simple functions of its mean and vari- 
ance, (no + n) and log ( jzr^j x (So + S n ) can also be written as the images by 

certain functionals of the posterior density (21). As a result, a can be obtained 
from (21) only, which we have shown to be impossible 

□ 
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